DOCUMENT RESUME 



ED 078 023 



TM 002 838 



TITLE 

INSTITUTION 
PUB DATE 
NCTE 

AVAILABLE FROM 

JOURNAL CIT 

EDRS PRICE 
DESCRIPTORS 



Reliability and Confidence. 
Psychological Corp. ^ New York^ N.Y. 
May 52 
6p.; Reprint 

Test Service Bulletin » The Psychological Corporation, 
New York, N.Y. . 

Test Service Bulletin; nHH p2-7 May 1952 
MF-J0.65 HC-$3.29 

Bulletins; Correlation; Scores; statistical Analysis; 
*Test Interpretation; *Test Reliability; Test 
Selection 



ABSTRACT 

Some aspects of test reliability are discussed. 
Topics covered are: (1) how high should a reliability coefficient 
be?; (2) t%»o factors affecting the interpretation of reliability 
coefficients—range of talent and interval between testings; (3) some 
common misconceptions — reliability of speed tests^ part vs. total 
reliability » reliability for what group? » and test reliability vs. 
scorer reliability; and (4) a practical checklist. . (For related 
documents, see TM 002 839-840.) (KM) 



j FILMED FROM ^EST'AVAILABLE COPY 



do 
1^ 



Test Service Bulletin 

iU Nos. 4446 THE PSYCHOLOGICAL CORPORATION 1952-1954 
Georg£ K. Bennett, President 

Published from time to time in the interest of promoting greater understanding of the principles and techniques 
of mental measurement and its applications in guidance, personnel work, and clinhal psychology, and for 
announcing new publications of interest. Address commumcations to 304 East 45th Street, New York 17, N. Y. 

Harold G. Seashore, Editor Jerome E. Doppelt Dorothy M. Clendenen 

Director of the Test Division Assistant Director Assistant Director 

Alexander G. Wesman James H. Ricks, Jr. Esther R. Hollis 

Associate Director of the Test Division Assistant Director Advisory Service 



\ 



of articles from earlier 



issues 



44. Reliability and Confidence 2 

45. Better Than Qianee 8 

46. The Correction for Guessing 13 



U S OEPARTMENTOFHEALTH. 
EDUCATION A WELFARE 
ffATIONAL INSTITUTE OF 
EDUCATION 
THIS DOCUMENT HAS BEEN REPRO 
OUCED EXACTLY *S RECEIVED FROM 
THE PERSON OR DRGANIZAl ION ORIGIN 
ATlNGir POINTS OF VIEWDR OPINIONS 
STATED 00 NOT NECESSARILY REPRE 
SENT OFFICIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OR POLICY 



o 



X5 



The contents of Ms Bulletin are not copyrighted; the articles may be quoted or reprinted without formality other 
than the customary ccknowledgment of the Test Service bulletin of The Psychological Corporaton as the source. 

ERLC 

/ 



1^ 



Test Service Bulletin 





fM No. 44 THE PSYCHOLOGICAL CORPORATION May, 1952 



Published from time to time in the interest of promoting greater understanding of the principles and techniques 
of mental measurement and its applications in guidance, -personnel work, and clinical psychology, and for 
announcing new publications of the Test Division. Address communications to 522 Fifth Avenue, New York 36. 

Habou) G. Seashore, Editor Jerome E. Doppelt 

Director of the Test Division MarjORIE Gelink 

Alexander G. Wesman James H. Ricks, Jr. 

Associate Director of the Test Division AssistatU Directors 



RELIABILITY AND CONFIDENCE 



00 

CO 

00 

o 
© 



THE chief purpose of testing is to permit us to arrive at judgments concerning the people being tested. 
If those judgments are to have any real merit, *hey must be based on dependable scores — which, in turn, 
must be earned on dependable tests. If our measuring instrument is unreliable, any judgments based on 
it are necessarily of doubtful worth. No one would consider relying on a thermometer which gave readings 
varying from 96" to 104" for persons known to have normal temperatures. Nor would any of us place con- 
fidence in measurements of length based on an elastic ruler. While few tests are capable of yielding scores 
which are as dependable as careful measurements of length obtained by use of a well-marked (and rigid!) 
ruler, we seek in tests some satisfactory amount of dependability — of ''rely-ability.'* 

It is a statistical and logical fad that no test can be valid unless it is reliable; knowii g the reliability of a 
test in a particular situation, we know the limits beyond which validity in that situation cannot rise. Knowing 
reliability, we know also how. large a band of error surrounds a test score - how precisely or loosely that score 
can be interpreted. In view of die importance of the concept of reliability, it is unfortunate that so many 
inadequacies in the reporting and use of reliability coefficients are to be found in the literature. This article 
is intended to clarify some aspects of this very fundamental characteristic of tests. 

however propitious his mental and physical condition. 

This two-fold purpose of reliability coefficients is 
reflected in the several n:ethods which have been 
developed for estimating leJfability. Methods which 
provide e timates based on a single 
sitting offer evidence as to the 
precision of the test itself; these 
include internal consistency esti- 
mates, such as those obtained by 
use oi' the split-half and Kuder- 
Richardson techniques when the 
test is given only once, as well 
as estimates based on immediate 
retesting, whether with the same 
form or an equivalent one. When 
a time interval of one or more 
days is introduced, so that day- 
to-day variability in the person 
taking the test is allowed to have 
an effect, we have evidence con- 
cerning the stability of the trait and 
of the examinee as well as of the 
test. It is important to recognize 
whether a reliability coefficient 
describes only the test, or whether it describes the 
stability of the examinees' performances as well. 



Reliability coefficients are designed to provide esti- 
mates of the consistency or precision of measurements. 
When used with psychological tests, the coefficients 
may serve one or both of two purposes: (1) to 
estimate the precision of the test 
itself as a measuring instrument, or 
(2) to estimate the consistency of 
the examinees performances on the 
test. The second kind of reliability 
obviously embraces the first. We 
can have unreliable behavior by 
the examinee on a relatively reliable 
test, but we cannot have reliable 
performance on an unreliable in- 
strument. A student or applicant 
suffering a severe headache may 
give an uncharacteristic perform- 
ance on a well-built test; the test 
may be reliable, but the subjects 
periFonnance is not typical of him. 
If, however, the test items are am- 
biguous, the directions are unclear, 
or the pictures are so poorly repro- 
duced as to be unintelligible — if, 
in short, the test materials are themselves inadequate 
— the subject is prevented from performing reliably, 



The Psychological Corporation believes 
that tests should b? boutjht on the basis of their 
qualify as measuring instruments and their appro- 
priateness for the user's purpose. 

This article is one of a series offered to help 
counselors, personnel men, psychologists, psychi* 
atrists, and educators to a fuller understanding 
of mwiital measurements so that they can choose 
tests more wisely and use them more effectively. 
Previous issues of this Bulletin include: 
No. 36 on the concept of aptitudes 
No. 37 on the validation of tests 
No. 38 on the use of expectancy tables 
No. 39 on norms 
Nc. 40 on correlation coefficients 
Nos. 41 and 43 on the identification of chil- 
dren's special abilities 
No. 42 on the cost of testing 

Upon reouest. The Psychological Corpor* 
ATioN will be glad to send copies of any of these 
earh'er Bulletins without charge. 




TEST SERVICE BULLETIN 



How High Should a ReUabihiy Coefficient Be? 

We should naturally like to have as much con- 
sistency in our measuring instruments as the physicist 
and the chemist achieve. However, the complexities 
of human personality and other practical considera- 
tions often place limits on the accuracy with which 
we measure and we accept reliabflity coefficients of 
different sizes depending on various purposes and 
situations. Perhaps the most important of these con- 
siderations is tlie gravity of the decision to be made 
on the basis of the test score. The psychologist who 
has to recommend whether or not a person is to be 
committed to an institution is obligated to seek the 
most reliable instruments he can obtain. The counselor 
inquiring as to whether a student is likely to do better 
in one curriculum or another may settle for a slightly 
less reliable instrument, but his demands should still 
be high. A survey of parents' attitudes towards school 
practices needs only moderate reliability, since only 
the average or group figures need to be highly 
dependable and not the specific responses of indi- 
vidual parents. Test constructors experimenting with 
ideas for tests may accept rather low reliability in the 
early stages of experimentation — those tests which 
show promise can then be built up into more reliable 
instruments before publication. 

It is much like the question of how confident we 
wish to be about decisions in other areas of living. 
Ihe industrial organization about to hire a top 
executive (whose decisions may seriously affect the 
entire business) will usually spend large sums of 
time and money to obtain reliable evidence concern- 
ing a candidate s qualifications for the job. The same 
firm will devote far less time or money to the hiring 
of a clerk or office boy, whose errors are of lesser 
consequence. In buying a house, we want to have as 
much confidence in our decision as we can reasonably 
get. In buying a package of razor blades, slim evidence 
is sufficient since we lose little if we have to throw 
away the entire package or replace it sooner than 
expected. The principle is simply stated: the more 
important the decision to be reached, the greater is 
our need for confidence in the precision of the test 
and the higher is the required reliability coefficient. 



Two Factors Affecting the Interpretation 
of Reliahility Coefficients 

Actually, there is no such thing as the reliability 
coefficient for a test Like validity, reliability is 
specific to the group on which it is estimated. The 
reliability coefficient will be higher in one situation 
than in another according to circumstances which 
may or may not reflect real differences in the precision 



of measurement Among these factors are the range 
of abflity in the group and the interval of time 
between testings. 

Range of talent 

If a reliabiUty estimate is based on a group which 
has a small spread in the ability measured by the 
test, the coefficient will be relatively low; If the group 
is one which has a wide range in that particular 
talent the coefficient will be higher. That is, the 
reliabflity coefficient will vary with the range of 
talent in the group, even though the accuracy of 
measurement is unchanged. The following example 
may illustrate how this comes about. For simplicity, 
we have used small numbers of cases; ordinarily, far 
hxger groups would be required to ensiure a coefficient 
in which we could have confidence. 

In Table I are shown the raw scores and rankings 
of twenty students on two forms of an arithmetic 
test. Looking at the two sets of rankings, we see that 
changes in rank from one form to the other are 
minor; the ranks shift a little, but not importantly. 
A coefficient computed from these data woidd be 
fairly high. 



Table I. Raw Scores and Ranks of Students 
on Two Forms of an Arithmetic Test 



Student 


Form X 


Form Y 


Score 


Rank 


Score 


Bank 


A 


90 


1 


88 


2 


B 


87 


2 


89 


1 


C 


83 


3 


76 


5 


D 


78 


4 


T7 


4 


E 


72 


5 


80 


3 


F 


70 


6 


65 


7 


G 


68 


7 


64 


8 


H 


65 


8 


67 


6 


I 


60 


9 


53 


10 


J 


54 


11 


57 


9 


K 


51 


11 


49 


11 


L 


47 


12 


45 


14 


M 


46 


13 


48 


12 


N 


43 


14 


47 


13 


0 


39 


15 


44 


15 


P 


38 


16 


42 


16 


Q 


32 


17 


89 


17 


R 


30 


18 


34 


20 


S 


29 


19 


37 


18 


T 


25 


20 


36 


19 



ERIC 



3 



TEST SERVICE BULLETIN 



Now, however, let us examine only the rankings of 
the five top students. Though for these five students 
the shifts in rank are the same as before, the impor- 
tance of the shifts is greatly emphasized. Whereas in 
the larger group student Cs change in rank from 
third to fifth represented only a ten per cent shift 
(two places out of twenty), his shift of two places in 
rank in the smaller top group is a forty per cent 
change (two places out of five). When the entire 
twenty represent the group on which we estimate 
the rdiability of the arithmetic test, going from third 
on form X to fifth on form Y still leaves the student 
as one of the best in this population. If, on the other 
hand, reliability is being estimated only on the group 
consisting of tiie top five students, going from third 
to fifth means dropping from the middle to the 
bottom of this population — a radical change. A 
coefficient, if computed for just these five cases, would 
be quite low. 

Note that it is not the smaller muni ^r of cases 
which brings about the lower coefficient. It is the 
narrower range of talent which is responsible. A 
coefficient based on five cases as widespread as the 
twenty (e.g., pupils A, E, J, O, and T, who rank first, 
fifth, tenth, fUFteenth, and twentieth respectively on 
form X), woald be at least as large as the coefficient 
based on all twenty students. 

This example shows why the reliability coefficient 
may vary even though the test questions and the 
stability of the students' performances are unchanged. 
A test may discriminate with satisfactory precision 
among students with wide ranges of talent but not 
discriminate equally well in a narrow range of talent. 
•A yardstick is imsatisfactory if we must difiFerentiate 
objects varying in length from 35.994 to 36.008 inches. 
Reliability coefficients reflect this fact, which holds 
regardless of the kind of reliability coefficient com- 
puted It should be obvious, then, that no reliability 
coefficient can be properly interpreted without in- 
formation as to the spread of ability in the group on 
which it is based. A reliability coefficient of .65 based 
on a narrow range of talent is fully as good as a 
coefficient of .90 based on a group with twice that 
spread of scores. Reliability coefficients are very much 
a function of the range of talent in the group. 

Interval between testings 

When two forms of a test are taken at a single 
sitting, the reliability coefficient computed by cor- 
relating tile two forms is likely to overestimate some- 
what real accuracy of the test This is so because 
factors such as mental set, physical condition of 
examinees, conditions of test administration, etc. — 



factors which are irrelevant to the test itself — are 
likely to operate equally on both forms, thus making 
each person's pair of scores more similar than they 
otherwise would be. The same type of overestimate 
may be expected wbea rehability is computed by 
spht*-half or other internal consistency techniques, 
which are based on a single test administration. 
Coefficients such as these describe the accuracy of 
the test, but exaggerate the practical accuracy of the 
results by the extent to which the examinees and the 
testing situation may normally be expected to fiuc* 
tuate. As indicated above, coefficients based on a 
single sitting do not describe the stability of the sub- 
jects' performances. 

When we set out to investigate how stable the tes: 
residts are likely to be from day to day or week to 
week, we are likely to imderestimate the test's 
accuracy, though we may succeed in obtaining a 
realistic estimate of stability of the examinees' per- 
formances on the test. The underestimation of the 
test's accuracy depends on the extent to which 
changes in the examinees hav i taken place between 
testings. The same influences mentioned above — 
mental set, physical condition of examinees, and the 
like — which increase coefficients based on a single 
sitting are likely to decrease coefficients when testing 
is done on different days. It is unlikely for exampie, 
that the same persons who had headaches the first 
day will also have headaches on the day of the second 
testing. 

Changes in the persons tested may also be of a 
kind directiy related to the content of the particular 
test If a month has elapsed between two administra- 
tions of an arithmetic test, different pupils may have 
learned different amounts of arithmetic during the 
interval. The second testing should then show greater 
score increases for those who learned more than for 
those who learned less. The correlation coefficient 
under these conditions will reflect the tests accuracy 
minus the effect of differential learning; it will not 
really be a reliability coefficient 

For most educational and industrial purposes, the 
reliability coefficient which reflects stability of per- 
formance over a relatively short time is the more 
important Usually, we wish to know whether the 
student or job applicant would have achieved a 
similar score if he had been tested on some other 
day, or whether he might have shown up quite 
differently. It would be unfortunate and unfair to 
make important decisions on the basis of test restilts 
which might have been quite different had the person 
been tested the day before or a day later. We want 
an estimate of reliability which takes into account 



ERLC 



4 



TEST SERVICE BULLETIN 



accidental changes m day-to-day ability of the indi- 
vidual, but which has not been affected by real 
learning between testings. Such a reliability coefiBcient 
would be based on two sittings, separated by one or 
more days so that day-to-day changes are reflected in 
the scores, but not separated Dy so much time that 
permanent changes, or learning, have occurred.* Two 
forms of a test, administered a day to a week apart, 
would usually satisfy these conditions. If the same 
form of a test is used in both sittings, the intervening 
time should be long enough to minimize the role of 
memory from the first to the second administrations. 

Ideally, then, our reliability coe£Boient would 
ordinarily be based on two different but equivalent 
forms of the test, administered to a group on two 
separate occasions. However, it is often not feasible 
to meet these conditions: there may be only one form 
of the test available, or the group may be available 
for only one day, or the test may be one wtich is 
itself a learning experience. We are then fcrcedi to 
rely on coefiBcients based on a single administration. 
Fortunately, when such coefiBcients are properly used 
they usually provide close approximations to the esti- 
mates which would have been obtained with alternate 
forms administered at different times. 



Some Common Misconceptions 

^ Reliability of speed tests 

Although estimates of reliability based on one 
administration of the test are often satisfactory, there 
are some circimistances in which ordy retest methods 
are proper. Most notable is the case in which we are 
dealing with an easy test given under speed condi* 
tions. If the test is composed of items which almost 
anyone can answer correctly given enough time but 
which most people tested cannot finish in the time 
allowed, the test is largely a measure of speed. Many 
clerical and simple arithmetic tests used with adults 
are examples of speed tests. Internal consistency 
methods, whether they are of the Kuder-Richardson 
or of the split-half type, provide false and often 
grossly exaggerated estimates of the reliability of 
such tests. To demonstrate this problem, two forms 
of a simple but speed-laden clerical U >t were given 
to a gioup. For each form the odd-even (split-half) 

* A ( Select which is based on two testings between which 
opportunity for learning has occurred is a useful statistic. It 
may provide ei'idence of how much individual variation in 
learning has taken place, or of the stability of tiie knowledge, 
skills or aptitudes being measured. It is sunilar to a reliabifity 
coe£Bcient, and is in part a function of the reliability of the 
two measurements: but such a coefiBcient should not be 
interpreted as simply estimating reliability - it requires a more 
complex interpretation. 



reliability coefiBcient was found to be over .99. How- 
ever, when scores on Form A were correlated with 
scores ou Form the coefficient was .88. This latter 
value is a more accurate estimate of the reliabibty of 
the test* Many equally dramatic illustrations of how 
spm:ious aa inappropriate coefficient can be may be 
foimd readily, even in manuals for professionally 
made tests. 

If a test is somewhat dependent on speed, but the 
items ran;;e in difficulty from easy to hard, internal 
consistency estimates will not be as seriously mis- 
leading as when the test items are simple and die 
test is highly speeded. As the importance of speed 
diminishes, diese estimates will be less different from 
the coefficients which would be obtained by retest 
methods. It is difficult to guess how far wrong an 
inappropriate coefficient for a speeded test is. When- 
ever there is evidence that speed is important in test 
performance, the safest course is to insist on an esti- 
mate of reliahility based on test-and-retest, if necos* 
sary with the same but preferably with an alternate 
form of the test 

Part vs. total reUabiUty 

Some of the tests we use are composed of several 
parts which are individually scored and the part 
scores are then added to yield a total score. Often, 
reliability is reported only for the total score, with 
no information given as to the reliability of the 
scores on the individual parts. This may lead to 
seriously mistaken assmnptions regarding the reli- 
ability of the part scores — and, thus, of the con- 
fidence we may place in judgments based on the part 
scores. The longer a test is, other things being equal, 
the more reliable it is; the shorter the test, the lower 
is its reliability likely to be. A part score based on 
only a portion of the items in a test can hardly be 
expected to be as reliable as the total score; if we 
treat the part score as though it has the reliability 
of the total score, we misplace c^or confidence — some- 
times quite seriously* 

As an example, we may look at the Wechsler 
InteUigence Scde for Children, one of the most impor- 
tant instruments of its kind. Five subtests are com- 
bined to yield a total Verbal Score for this test The 
reliability coefficient for the Verbal Score, based on 
200 representative ten-year-olds, is .96 — high enough 
to warrant considerable confidence in the accuracy 
of measurement for these youngsters. For the same 
population, however, a sin^e subtest (General Com- 
prehension) yields a reliability coefficient of only 
.73 — a far lesr, impressive figure. If we allow our* 

* Manual for the Differential Aptitude Tests, Revised Edition, 
page 65. The Psychological Corporation, 1952. 



TEST SERVICE BULLETIN 



selves to act as though the total test reliability 
coeflBcient of .96 represents the consistency of meas- 
urement we can expect from the Comprehension 
subtest, we are likely to encounter unpleasant sur- 
prises on future retests. More importantly, any clinical 
judgments which ignore the relatively poor reliability 
of the part score are dangerous. Test users should 
consider it a basic rule that if evidence of adequate 
reliability for part scores is missing, the part scores 
should not be used. 

Reliability for what group? 

This question may be considered as a special case 
under the principles discussed above with respect to 
range of talent. It is worth special consideration 
because it is so often ignored. Even the best docu- 
mented of test manuals present only limited numbers 
of reliability coefficients; in too many manuals a 
single coeflBcient is all that is made available. On 
what group should a reliability coefficient be based? 

When we interpret an inJividuals test score, the 
most meaningful reliability coefficient is one based 
on the group with which the individual is competing. 
Stated otherwise, the most appropriate group is that 
in which-the counselor, clinician or employment 




Each of us is a member of many groups 



, manager is trying to make decisions as to the relative 

; ability of the individuals on the trait being measured. 

, Any one person is, of course, a member of many 

i groups. An applicant for a job may also be classified 

: as a high school or college graduate, an experienced 

T or inexperienced salesman or bookkeeper, a local or 

: out-of-state person, a member of one political party 

^' or another, below or above age thirty, etc. A high 

c school student is a boy or girl; a member of an 

? academic, trade or commercial school group; a mem- 

i ber of an English class, a geometry class, or a wood- 

i working or cooking class; a freshman or a junior; a 

I future engineer or nurse or garage mechanic. Obvi- 

1 ously, it would be impossible for a test manual to 

2 oflFer reliability for all the groups of which any one 
{ individual is a member. 

h 

% 

ERIC 



The appropriate group is represented by the in- 
dividuals present competition. If we are testing appli- 
cants for clerical work, the most meaningful reliability 
coefficient is one based on applicants for clerical work. 
Coefficients based on employed clerical workers are 
somewhat less useful, those based on high school 
graduates are still less useful; as we go on to more 
general groups - e.g., all high school students or all 
adults — the coefficients become less and less mean- 
ingful. Similarly, as we go to less relevant groups 
(even though they may be quite specific) the reli- 
ability coefficients are also less relevant and less 
meaningful. The reliability of a test calculated on the 
basis of mechanical apprentices, college sophomores, 
or junior executives reveals. little of importance when 
we are concerned with clerical applicants. What we 
need to know is how well the test discriminates among 
applicants for clerical work. If we can define the 
population with even greater specificity and relevance 
-e.g., female applicants for filing jobs-so much the * 
better. The closer the resemblance between the group 
on which the reliability coefficient is based arid the 
group of individuals about whose relative ahility we 
need to decide, the more meaningful is that coefficient 
of reliability. 

Test reliability vs. scorer reliability 

Some tests are not entirely objective as to scoring 
method; the scorer is required to make a judgment 
as to the correctness or quality of the response. This 
is frequently true in individually-administered tests 
(Wechsler or Binet for example), projective tech- 
niques in personality measurement (Rorschach, Sen- 
tence Completion, etc.) and many other tests in which 
the subject is asked to supply the answer, rather 
than to select one of several stated choices. For tests 
such as these, it is important to know the extent of 
agreement between -the persons who score them. Test 
manuals usually report the amount of agreement by 
means of a coefficient of correlation between scores 
assigned to a set of test papers by two or niore in- 
dependent scorers. 

Such a correlation coefficient yields important in- 
formation-it tells us how objectively the test can be 
scored. It even contributes some evidence of reli- 
ability, since objectivity of scoring is a factor which 
makes for test reliability. Such a coefficient should 
not, however, be considered a reliability coefficient 
for the test; it is only an estimate of scoring reliabilit)' 
-a statement of how much confidence we may have 
that two scorers will arrive at similar scores for a' 
given test paper. Moreover, it is possible for a test 
to be quite unreliable as a measuring instrument, yet 
have high scoring objectivity. We should remember 
that many objective tests-those in which the person 



I- 



I 



if'' 

f 



1^ 
% 



f 



^1 



TEST SERVICE BULLETIN 



selects one of several stated options— are not very 
reliable, yet the scoring is by definition objective. A 
short personality inventory may have a retest reli- 
ability coefficient of .20; but if it is the usual paper- 
and-pencil set of questions with a clear scoring key, 
two scorers should agree perfectly, except for clerical 
errors, in assigning scores to the test. T^e coefficient 
of correlation between their sets of scores might well 
be 1.00. 

In short, information as to scorer agreement is im- 
portant but not sufficient. The crucial question— How 
precisely is the test measuring the individual?— is not 
answered by scorer agreement; a real reliability co- 
efficient is required. 

A Practical Check-list 

When reading a test manual, the test user would 
do well to apply a mental check-list to the reliability 
section, raising at least the following questions for 
each reliability coefficient: 

1. What does the coefficient measure? 

a. Precision of the test— coefficient based on 
single sitting? 

b. Stability of examinees* test performances- 
coefficient based on test-and-retest with a 
few days intervening? 

2. Is it more than a reliability coefficient? .... 
does it also measure constancy of the trait? .... is 
the coefficient based on test-and-retest with enough 
Intervening time for learning or similar changes to 
have occurred? 



3. Do scores on the test depend largely on how 
rapidly the examinees can answer the questions? If 
so, is the reliability coefficient based on a test-and- 
retest study? 

4. Are there part scores intended for consideration 
separately? If so, is each part score reliable enough to 
warrant my confidence? 

5. Is the group on which this coefficient is based 
appropriate to my purpose? Does it consist of people 
similar to those with whom I shall be using the test? 

6. Since a reliability coefficient, like any other 
statistic, requires a reasonable number of cases to be 
itself dependable, how large is the group on which 
the coefficient is based? 

If, and only if, the coefficients can be accepted as 
meeting the above standards, one may ask: 

7. In view of the importance of the judgments I 
shall make, is the correlation coefficient large enough 
to warrant my use of the test? 

A reliabflity coefficient is a statistic— simply a num- 
ber which summarizes a relationship. Before it takes 
on meaning, its reader must understand the logic of 
the study from which the coefficient was derived, the 
nature of the coefficient and the forces which affect 
it. Statistics may reveal or conceal— what they do 
depends to a very large extent on the logical ability 
and awareness the reader Wrings to them. Figures do 
lie, to those who don't or won t understand them. 

-A.G.W. 



A screening end counseling aid of interest to high school and college users 

•SURVEY OF STUDY HABITS AND ATTITUDES 

WiLLUM F. Brown and Wayne H. Holtzman, University of Texas 



To counselors and educators, the study habits and 
attitudes of their students are of immense importance. 
It is in these terms that one most often seeks the ex- 
planation of the well-endowed student who earns only 
poor grades while others with mediocre scholastic 
aptitude are achieving a better record. The Brown- 
Holtzman Survey of Study Habits and Attitudes 
(SSHA) is designed to detec' cases in which this is a 
likely source of difficulty in college and to help these 
students and their advisors plan steps which may 
avert the difficulty. 



In taking the SSHAy the student indicates the fre- 
quency with which he practices or the extent to which 
he agrees with each of seventy-five study procedures 
or beliefs. The scoring keys reflect systematic develop- 
ment and rigorous cross-validation against actual 
graaes earned in ten colleges, from Amherst to 



U.C.L.A. As a result, the SSHA can be used effectively 



as 



a) a screening device, to identify among fresh- 
men entering college those most likely to 
need early preventive help; 

b) a diagnostic instrument and counseling aid, 
by use of a special Counseling Key which 
c|n be laid over the student s answer sheet 
to indicate specific practices or beliefs 
which may handicap him; 

c) a teaching aid, not only in remedial or how- 
to-stucly classes but also in elementary 
psychology and education courses where it 
stimulates lively discussion both of scores 
and of the statements which make up the 
Survey; and 

d) a research tool, in investigations of the 
learning or the counseling processes. 



ERIC} 



Ik 



o 

Ob 



UJ 



Test Service Bulletin 



No- 45 THE PSYCHOLOGICAL CORPORATION 



May, 1953 



Published from time to time in the interest of promoting greater understanding of the principles and techniques 
of mentd measurement and its applications in guidance, personnel work, and clinical psycMogy, and for 
announcing new publications of the Test Division. Address communications to 522 Fifth Avenue, New York 36. 

Harold Q. Seashore, EdUor Jerome E. Doppelt 

Director of the Test Division MabjoRIE Geunt 

Alexander G. Wesman James H. Ricks, Jr. 
^ Associate Director of the Test Division Assistant Direaors 



CO 

OQ BETTER THAN CHANCE 

i^rriESTS with a coefficient of validity less than .50 are practicaUy useless, except in distinguishing be- 
C\f 1 tween extreme cases since at that value of r the forecasting efficiency is only 13.4 per cent"' This 

O statement is quoted from one of the leading statistical texts; its paraphrase may be found in manv 
oA^ texte. m do^^oral dissertations and other treatises of greater or ksser SuthSrity. Brrehtively ?ew vS 
coefficients, especially in indusby, exceed .50. ^ vaiiaiiy 

Why are tests being used even though they generaUy faU into this "practicaUy useless" class? Is it because of 
Ignorance or ^he part of test users? Not at all Witness the statement by tlie aX of thTaWe qu^^^^^^ t 
reviewing a ith vaUdity coefficients averaging .35 to .55 in various institutions: «[the testrhLTown sub 
g stantial valu redicting scholarship at ihe graduate bvel."^ Now the forecasting efficiency (using ST same 

^ uZ.r i7^X commendation. TTie reade, might justiaably be corfused - if the expert can't agree 
with hunseit, what is the counselor or personnel man t) think? 

Reassurance is in order. TJe test user may foUow ti.e practice of tl«> expert, without violating the principle 
enuncia ed m the texts. The "index of forecasting efficiency" as formulated L Uie texts is concaJfed withTp?e 
rtrtlr^Hp^^^vK ^^^-^ 5«q'^ed in most pracUcd situations. As a measure of the real utilit? of 

meSare i^provS misleadmg. A more crucial consideration Lc the extent to which broader judg- 

The difference between the two concepts can be we make our predictions, the smaller this average 

l"^' 'T?'' P'^"**'*^*!""' ^ d^ff^'^"'^^ ^ill become, and the pTrrnta^by Sh 

different situahons, of how far several men can it decreases is analogous to the per cent of toove 

broadjump If the occasion is an athleHc contest, we ment over chance. But suppose^we move f rom Ae 

might want to predict just how many feet and inches athletic contest or theoretic^ JaboratoTsituSn to 

each man wiU cover. TTie average difference between one in which the practical vaKrext^^mrsay 

our estunated distances and the actual jumps will one in which its necessary to l7ap acrossTbrS' 

the better (lc.. the more valid) the basis on which feet wet as wiU those who miss by six feet And those 



"References denoted by superscript numbers wiU be found at the end of this article, page 1£ 

8 



ERIC 



TEST SERVICE BULLETIN 



who just clear it M'ill be as useful on the other side 
as those who sail over with five feet to spare. Now 
the test of the efficiency of our predictive test lies 
in the confidence with which it permits us to say, 
"of men who score like this, nine out of ten will make 
it) but of those whose scores are the lowest only three 
out of ten will get across." Of course, the absolute 
dichotomy is as extreme in its way as the pinpoint 
precision estimate is at the other extreme. But when 
we are trying to guess in which general category, 
high, middle or low — the champions, the experts, 
the good, the just average, or the duffers — our candi- 
dates will fall, we re closer to the second situation 
than the first Most counselors, personnel men, and 
clinicians have to work with these cruder approxima- 
tions. 

Per cent of improvement over chance, as used with 
the index of forecasting efficiency, refers to the nar- 
rowing of a zone of error around a predicted score. 
When the validity coefficient is 7«ro, knowledge of 
a test score does not permit us to pic^ict an individual s 
score on the criterion with any accuracy at all; the 
best guess we can make with respect to any individual, 
regaidless of how he scored on such a test, is that he 
will be average on the criterion. The band of error 
(the standard error of estimate) is as large as the 
spread (the standard deviation) of the ratings on the 
criterion for the entire group. As the correlation be- 
tween the test scores and the criterion ratings increase^ 
our precision in predicting ratings of individuals on 
the criterion also increases and we may predict with 
some degree of confidence, for example, that a person 
who scores in the top quarter on the test will be rate^ 
in the top quarter on the criterion as well. Of course, 
some of our predictions will be in error: i. e., some of 
those whose scores are in the top quarter on the test 
will be rated in the second quarter on performance, a 
smaller number in the third quarter, and a few may 
even be rated in the lowest quarter. The larger the 
validity coefficient, the fewer misplaced persons there 
wUl be; furthermore, the smaller will be die amount of 
displacement. In other words, if the validity coefficient 
is really high, we may expect most of those who 
score in the top quarter on die test to be rated in the 
top quarter on performance as wdl, a very few to be 
rated in the second quarter, and fewer still ( or perhaps 
even none at all) to be rated in the third or fotarth 
quarters. 

The number of persons for whom statistically cal- 
culated predictions are wrong, and the amount by 
which our estimates aic in error are reflected in the 



standard error of estimate. When validity is perfect, 
the standard error of estimate is zero; when validity 
is zero, the standard error of estimate is at its n. vc- 
imiun. As the validity increases, the standard error 
of estimate decreases. The degree to which the stand- 
ard error of estimate is reduced is what is meant by 
the textbook statements concerning improvement over 
chance. In this sense, large validity coefficients are 
necessary; It takes an r = .866 to cut the standard 
error of estimate even to half the size of the standard 
deviation of the criterion ratings — a ^fifty per cent 
improvement over chance." 

What permits us to use tests effectively even though 
their validity coefficients are considerably lower than 
.866? Furst, there is the matter of precision. The stand- 
ard errnr of estim&te refers to the band of enor 
around predictions of precise, specific rankings of each 
individcud on the criterion. In most practical work, 
such precision is unnecessary. We do not ordinarily 
need to predict that :,ohn Jones wfll be exacdy at die 
85di percentile in a college class, or that Bill Smith will 
be 19th in a group of 25 engineering apprentices. We 
are far more likely to be concerned with whether 
Jones will survive die first year in college, or whether 
Smith will be one of die satisfactory apprentices. 
For these purposes, whether Jones is at die 75th per- 
centile or 90tb percentile is of lesser moment; we can 
make a quite confident prediction that he Mrill succeed, 
even though there may be a fair-sized standard error 
of estimate applicable to die specific percentile our 
formula predicts. 

A second factor working in our favor in the prac- 
tical use of tests is that, as die opening quotation 
notes, predictions are most accurately made at the 
extremes — and it is the extremes that are of greatest 
interest to us. Few colleges grant large scholarships 
to more than 10 or 20 per cent of their students. Few 
colleges fail as many as half their students and few 
industrial firms fire as many as half of those diey hire. 
More often, the failures are 10 per cent or 20 per cent 
or possibly 30 per cent — the extremes. Thus a test 
which does not predict with accui ^cy whether students 
will be at the 40th percentile or the 60th percentile, 
can still do a valuable service in predicting that very 
few of the high scorers v/ill be in the 20 per cent who 
fail during die freshman year» or that hardly any 
scholarship winn.^ will be academic failures. In in- 
dustrial selection, a test of moderate validity can be 
efficient in quickly screening out the ''clearly ineligible" 
ftrom the "dearly eligible." There will remain an in- 
different zone of tet scores for persons in the ''eligible* 



9 



TEST SERVICE BULLETIN 



range; for them» other considerations than test scores 
may determine whether they should be hired. 

Let us look at some data. One hundred nicety-one 
eighth-grade boys took the Verbal Reasoning Test of 
the Differential Aptitude Tests (DAT) battery at 
the start of a term. At the end of the term, the grades 
they earned in a Social Studies course were ol)tained. 
Seventy-six were found to have earned grades of D 
or lower; they represented 40 per cent of (tit total 
class. On the basis of chance (i e., using a test with 
zero validity), we should expect to fiud that 40 per 
cent of those at each test score level — low, medium 
or high — obtained grades of D or lower. The co- 
e£Ficient of correlation between die test scores and 
these grades was .61, for which the index of fore- 
casting eflBciency comes out to just 20 per cent better 
than chance - hardly enough to notice. Table I 
reveals a very diflFerent story - it shows *e test to 



Table L Chance expectations and aclnaS perform* 
ancea In a social stndles class in relation to 
DAT^Verbnt Reasoning 9eor€t% 



DATVeri»«l 
Reatonlac 


No«of 
P«piU 


% expected 
hjr ehmtmto 
««ni D, E, or F 


% metuattr 
earniag 
D,E,orF 


26-up 


19 


40 


6 


18-25 


49 


40 


14 


10-17 


60 


40 


36 


2- 9 


63 


40 


73 



be a highly efficient predictor for the school s pur- 
posesl Instead or 40 par cent of the highest-scoring 
pupils being found in the low grades group (as one 
would expect by chance), only six per cent are found 
there.^ 

Another example, drawn from the area of industrial 
testing, is shown in Table It The Short Employment 
Tests (SET) were administered to 74 stenographers 
at a single level of job responsibflity, and the rela- 
tionships between scores on the tests and on-the-job 
proficiency ratings were investigated, liie girls were 
rated as low, average or high in ability; the table 
shows, for each of these groups, what per cent were 
in each third on the Clerical aptitude test of (he SET 
battery. 

By chance alone, the per cent of upper, middle 
and low fcorers in each of the rated groups would 



be the same - in this case, SSKJ. The boldface numbers 
in the table would consist cf nine Ws. Note bow 
closely this expected per cent is approximated for 
those ranked average in proficiency, and for those 
in the. middle third on test score; the percentages in 
the middle row and those in (tie middle column run 
beiMrfen 28 and 36. Note also that at the extremes 
- the four comer numbers — the prediction picture 
is more promising. Among those rated low, there are 
almost three times as msny people from the lowest 
third on the test as there are from the top Aird. Among 
those rated high, the per cent from the top thltd on the 
test is almost two and c^e-half times as great as (he 
per cent from the bottom third. Tbe penonnel man 
would do '.veil to be guided by these data in selecting 
future stenographers, even though the validity co- 
efficient is just .38. 



Table IL Per cent of stenographers in each third 
on SET^Cleried who earned yariom profidenej 
ratings* 



SET^erlcal 
Test Score 


Profidenejr Rating 


Lew 


Averege 


High 


Upper Tliird 


It 


33 


50 


MiddU-Diird 


29 


36 


28 


Lowest Tliird 


53 


;m 


22 


Tofai Por Cent 


100 


100 


100 


No. of Stenographers 


17 


■19 


18 



The data in the above examples are based oa rela- 
tively small numbers of cases (whicht is typically 
true of practical test situations) and the per cents 
found in each category are consequendy somewhat 
unstable. The validity coefficients based on groups 
of such sizes are, of course, also less stable than co- 
efficients based on large numbers of cases. The wise 
test user will make several validity studies using suc- 
cessive groups. Having done so, he may take an aver- 
age of the validity coefficients from these studies as 
being a more dependable estimate of the validity of 
the test in his situation. Formal tables are avaikble^ 
which can be used to estimate expectancies when (he 
validity coefficient is of a given size ^•^d the per cent 
of successes and failures is known. Table III has 
been constructed from these formal tables to illus- 
trate the usefulness of coefficients of various magni- 
tudes. . 



10 



TEST SERVICE BULLETIN 



Table III. Per cent of encccM^nl individucU in eacli decile on test score — 



Standing on 


when the total per cent 


when the total per cent 


vrhen the total per cent 


the t«9t 


of failures it 20%, 


and 


of faUaree It 30%^ 




of faiinres is 50%, 


mn%M 


Percentile 


Decile 


r=.30 


r=.40 


r=.50 


r=.60 


r=.30 


r=.4e 


r=.50 


lr=.60 


r=.30 


r=:40 


r=.50 


r=.60 


90-99th 


10 


92% 


95% 


97% 


99% 


86% 


91% 


94% 


97% 




78% 


84% 


90% 


80.89th 


9 


89 • 


91 


94 


97 


81 


85 


89 


92 


63 


6P 


?i 


78 


70-79fh 


8 


86 


89 


91 


94 


78 


>il 


34 


88 


59 


62 


65 


69 


60.&9th 


7 


84 


86 


88 


91 


75 


77 


80 


83 


55 


57 


59 


61 


5C>59th 


6 


82 


84 


85 


87 


72 


74 


75 


77 


52 


52 


53 


E4 


MtA9\h 


S 


80 


81 


82 


83 


70 


70 


70 


71 


48 


48 


47 


46 


30.39th 


4 


78 


77 


77 


78 


67 

1 


66 


65 


64 


45 


43 


4! 


39 


20.29th 


3 


75 


73 


72 


71 


63 


61 


59 


56 


42 


38 


35 


31 


IO.I9Hi 


2 


71 


68 


64 


61 


59 


55 


50 


45 


37 


33 


28 


22 


I. 9th 


1 


63 


56 


49 


40 


50 


43 


35 


27 


29 


23 


16 


10 



The fint part of Table III is based on a failure rate 
of 29 per cent It shows the per cent of individuals 
at different levels on the test who are successful (in 
marks eanicd» or dollar sdes, or merit rating, or num- 
ber of widgets afii^^mbled, or whatever we arc tiying 
to preflict) when the validity coefficient is *30, .40, 
.50, or .61 The columns in boldface at the left show 
the decile rank on the test — individuals with per- 
centile rr.nks of 90 to 99 are in the tenth decile or top 
10 per cent, those with percentile ranks from 80 to 89 
are in the next (9ch) decile, etc.; the fir^t decile in- 
cludes the individuals between the first and ninth per- 
centiles on tlie test — the 10 per cent who scored U w- 
est. In the first lightfac(; column is shown the per cent 
of persons in each decile w^ may be expected to 
succeed when ths validity coefficient (r) is JiO; the 
second column in lightface typo presents similar 
expectancy information when r = .40, H ^ next coIuPiP. 
is for r = »50, and the last cclumn for a validity t-o* 
efficient of .60. 

What does this table tell us? Suppose* th«r< the fail- 
ure rate among Winsocki college fri*^hir.^a is about 
20 per cent - that usually one out of e'-r i-y five students 
fails or foes on probation before the end of the y^***-. 
A selection test is given and 9. correlation of .30 is 
found between scores on the t^^t and st*ca^ in the 
first year. Ninety-two pc*r ceA of those who score in 
the top 10 per cent oi tb^ group on the test may be 
expected to succec<i> v hile only 63 per cent in the 
bottom decile csu cr^iect to smvive tfie firsl year. If 
the validity coe^icJ^nt is .40, ninety-five per cent in 
the top decih rriry be expected to survive; of the low- 



est scoring students, 56 per cent are likeiy to be around 
at the end of the year The survival rate whet r = 
*60 is almost perf x*t (99 per cent) for the top group; 
it is only 40 per cent for the lowest scorers. 

The last two sections of Table III present sLnilar 
infonnation foi* coefficients of .30, .40, .50, anr^ .60 
when failure rates ^^ire 30 per cent and 50 per cent. 
The last column ;>ic the right shows, for example, that 
if only 50 per cent of a total group is succes.^, and 
the validity coefncie^it is .60, the top scoring individ- 
uals will havx* a survival rate of 30 per cent; of those 
m the bof*om decile on the test, only one out of ten 
is likeiy to Fuoceed. 

is interesting to <x)mpare the figures in the column 
headed r = .50 (when failures total 20 per cent) with 
the quotation with w^iich we began. The ''only 13.4 
per cent** sort of statement way be (and often has 
been) misinterpreted a.^ indicating that the test can 
tell us little. Actually, the test has changed our picture 
dramatically. Without i\ wc couH s^y only that for 
every person the od'^s are four chances to one hell 
succeed. With the test, we can sort the candidates into 
groups ^nd say that some ha 'e distinctly better pros- 
pects than others. If three meu score, respectively, in 
the tenth, the s<»venth and the lovrest deciles, we can 
give odds on their success: 

Without test With knovvI<jdg^3 
information of text 

Man in 10th decile 4tol 37tol (97.4!f-S.S!?) 
Man in 7th decUe 4*0 1 %tol (mA^) 
Man in 1st deci!#> 4<cl Itol (49!?-Sll) 



U 



TEST SERVICE BULLETIN 



What are the practical implications of these facts? 
Most apparent is the real potential utility of validity 
coe£Ficients of .60, .50, .40, and even .30; the infor- 
mation they provide is far from useless. For the coun- 
selor, they oflFer increased ability to estimate his client's 
general chances of success in an educational or voca- 
tional pursuit For the admissions o£Ficer in a college, 
better forecasts of drop-out rate, as well as more in- 
formed selection, are possible. For personnel men in 
industry, data such as these provide information with 
respect to the selection ratios which will be necessary 
to obtain a desired number of successful employees. 

As do all other statistics, standard errors of estimate 
and validity coefficients require full understanding. 
For all of us, our errors of estimate will always be 
greater than we would like. The precision of our esti- 
mates will be less than perfect, and we shall aim con- 
stantly to increase that precision. At the same time, if 
a test will increase appreciably our ability to predict 
(even though broadly) performance in curricula or 
careers, let us use it — with caution, but also with 
gratitude. A blade not sharp enough for shaving can 
still be used to cut a knot — A. G. W. 

Note: While this Bulletin was in press, another approach to 
the topic was published by W. L. Jenkins, "An index of selec- 
tive efficiency (S) for evaluating a selection plan.** Journal of 
Applied Psychologyy 1953, Vol. 37, p. 78. 



REFERENCES 

»J. P. Guilford* Psychometric MeOiods (New York: Mc- 
Graw-Hill, 1936) p. 364. The ind ex of forecasting 
effidency = 100 (1 — yT — where r is the 
validity coefficient, the correlation between the pre- 
dictor test and the subsequent performance rating or 
other criterion. When Jie number of cases is small, 

^ uis^^ under the 

square root si^. 

2J. P. Guilforc. test 304, page 407 in Euros: The Fourth 
Mental Measurements Yearbook (Highland Park, 
N. J.: The Gryphon Ptess, 1953). 

^For semitechnicai discussions of everyday ways of deoum- 
strating test validity, see The Psychological Cor- 
poration s Test Service BuHetins Nos. 37 and 38: 
**How EflFective Are Your Tests?^ and "Expectancy 
Tables: A Way of Interpreting Test Validity." 

<R. W. B. Jadcson and A. J. Phillips, "Prediction E£Ei- 
dendes by Deciles for Various Degrees of Rdation* 
ship." Educational Research Series No. 11, Dept of 
Educational Research, Ontario Gollege of Education, 
University of Toroi^to. Especially interested persons 
may find it wc^while to see also: H. G. Taylor and 
J. R. Russell, '^e relationship of validity coefBdents 
to practical effectiveness of tests in selectioiL" Journal 
of Applied Fsydwlogyy 1939, Vol. 23, pp. 565-578. 



Long awaited — sorely needed . • • indispensable • . . 

THS FOURTH MENTAL MEASUREMENTS 
YEARBOOK 

Oscar K. Buros, Editor, with 308 reviewers 

This is the latest in the series of the Buros Mental 
Measurements Yearbooksy which have become one 
of the most important references in the test field. 
Teachers, counselors, clinicians, personnel men — all 
serious test users, in fact — have found these volumes 
unique in their wealth of evaluative information and 
in the exhaustive reference listings they contain. Here 
are expert reviews of achievement tests, aptitude tests, 
individual and group intelligence tests, interest inven- 
tories, and measures of character and personality; 
tests of reading and tests of etiquette; of Latin and 
Greek, and health and home economics; of hearing, 
of manual dexterity and of aptitude for law school. 
Many of the reviews are sharply critical. Appropriate 
alarms are sounded for the weak points of many a 
test and test manual. 

As summarized in its prospectus, Tftc Fourth 
Yearbooky a large volume of 1,189 two-column pages. 



consists entirely of new material and supplements 
rather dian supplants previous yearbooks. [It] covers 
the period 1948 through 195L The section Tests 
and Reviews" lists 793 tests, 596 reviews by 308 
reviewers, 53 excerpts from test reviews in 15 journals, 
and 4,417 references on the construction, validity, 
use and limitations of specific tests. The section 
"Books and Reviews" lists 429 books on measurement 
and closely related fields, and 758 excerpts from book 
reviews in 121 journals." 

Those who reviewed The Third Mentd Measure^ 
ments Yearbook used such adjectives as ''montimental,'* 
"indispensable," "invaluable," "comprehensive." We 
have no doubt that The Fourth Yearbook will receive 
equal acclaim. No school, clinic or personnel oflSce 
can afford to be without it Its value to test users 
many times exceeds its cost 



The Third Mental Measurements Yearbook (1948), 
covering the years 1940-47, is still available. Its 713 reviews 
of tests and other contents are not duplicated in The Fourth 
Yearbooky and many tests reviewed in it have not been listed 
again in the new edition. 1047 pages. 



NOTE; In 1959 the Fifth , in 1965 the Sixth , and in 1972 the Seventh Mental Measure- 
ments Yearbook (2 vols.) were published. One who can afford only one of the Yearbooks 
should, of course, have the latest— both volumes of the Seventh . The Sixth , Fifth . 
Fourth , and Third Yearbooks are still available. Even the 1938 and the 1940 Yearbooks, 
out of print for years, are now available asain. on soecial order. See Cataloc for 



rsi 



<£> 
1^ 



ixl 



Test Service Bulletin 



No. 46 



THE PSYCHOLOGICAL CORPORATION 



January, 1954 



Published from time to time in the interest of promoting greater understanding of the principles and techniques 
of mental measurement and its applications in guidance, personnel work, and clinical psychology, and for 
announcing new publications of the Test Division. Address communications to 522 Fifth Avenue, New York 36. 



Harold G. Seashore, Editor 
Direciar of the Test Dioiskm 

Alexander G. Wesman 

Associate Director of the Test Dioision 



Jerome E. Doppelt 

Mai^orie Geunk 

James H. Ricks, Jr. 
Assistant Directors 



THE CORRECTION FOR GUESSING 



00 

© 



WHEN Fat and Mike laid down tfaeir picks and shovels and decided to apply for the job of mechanic's helper, 
they realized they would be competing with each other. Only one job was available. They consequently 
were not surprised when the personnel director asked them to take a test of mechanical comprehension to 
help the company decide which man would be selected. 

Pat, a cautious man, carefully read the directions for the test, learned he was to choose the best answer to 
every question from the three choices that were given, and proceeded to take the test He found he was quite 
sure of his answers to 36 of the 50 questions; for the remaining questions he could sometimes rule out one of 
the choices but he just could not select one answer with complete confidence. Pat felt it would be best not to try. 

Mike, on the other hand, was generally more willing than Pat to take a chance. After he answered the 23 
questions with which he had no difficulty, he decided to answer the remaining questions as best he could. As 
luck and partial information would have it, Mike managed to answer correctly 13 of the 37 items about which 
he had had doubts. 

The results of the scoring of the test papers were 36 rights and no wrongs for Fat; 36 rights and 24 wrongs 
for Mike* 

The test-maker had realized thai people will react dififerently when faced with multiple-choice test questions 
which they cannot answer with confidence. Some will not respond to such questions; others will risk answering 
tiiem. Consequently the test score was defined as the number of correct answers minus one-Iudf the ntmber of 
wrong responses. Thus Pat's score was 36 and Mike's score \vas 21 



In this instance, the correction for guessing resulted 
in a higher score for Pat than for Mike. Since we know 
how the two men took the test, it seems entirely fair 
for Pat to receive the higher rating. But how often do 
we know what has gone on in the minds of the 



exanuneesr 



All people who have scored or used multiple-choice 
tests know that there exist several "formulas" for ob- 
taining scores. We find among objective and semi* 
objective tests such di£Ferent scoring formulas as the 



number of right answers, the number of right answers 
minus the number of wrong responses, Rights minus 
H Wrongs, Rights minus Jj Wrongs, Rights minus H 
Wrongs, and the like. Psychometricians can usually 
tell, after a quick glance at the test content, what the 
scoring formula will be. If the test is of the completk)n 
type, the formula is the number of right answers; if 
the test is of the multiple-choice type, the number of 
right answers is often reduced by an amount equal to 
the number of wrong responses divided by one less 
than the number of options per item. 



ERLC 



IS 



I 



TEST SERVICE BULLETIN 



I 

The scoring formula for a particular test is deter- 
mined by applying the laws of chance in an attempt 
to correct for the effect of guessing on the part of the 
examinee. In a test made up of five-choice items, for 
example, the examinee may be expected to guess 
correctly the answer to one out of every five items he 
doesn't know or can't solve. Tlie total number of right 
answers therefore includes the number of items 
answered correctly on the basis of information plus 
the number of correct guesses. But how does one know 
how many answers are guesses? To determine the 
number of correct guesses we make use of the number 
of wrong responses. Thus, for every four wrong re- 
sponses it is assiuned the examinee made one correct 
guess. Consequently, the number of correct guesses 
is estimated for a five-choice test by dividing the num- 
ber of wrong responses by four. This is the correction 
for guessing and it is subtracted from the number of 
right answers.* Note that the basic assumption is that 
all wrong responses plus some of the right ones are 
classified as chance responses or guesses. 

Usually a ''guess** is interpreted as a positive state- ' 
ment or action based on chance. An omission or the 
withholding of a response is ordinarily not considered 
a guess. It is interesting, therefore, to find that in any 
given group, how much difference the so-called correc- 
tion for guessing makes depends on the number of 
omissions rather than on the number of actual guesses 
(which we never know). If everyone in a group taking 
a test answers all the items, the uncorrected scores 
(the number of right answers) will be perfectly corre- 
lated with corrected scores which take into account 
the number of wrong responses. The niunerical values 
of the corrected and uncorrected scores will of coiuse 
be different but the relative positions or ranks of indi- 
viduals in the group will be exactly the same. This 
can be demonstrated mathematically and will be true 
regardless of whether the wrong answers are due to 
chance responses, misinformation or partially correct 
information. The same situation exis^ if all students 
have the same number of omitted items, even though 
the specific omitted items differ from student to stu- 



• The fomiiila is R — where R is the number of right aarA'ers, 

W is the number of Wi ong responses and N is the number of 
choices per item. It is sometimes remarked that the larger the 
number of possible answers to a question, the smaller the 
importance of the correction formula. On a two-choice or 
true-false item, chance answers will be right fifty per cent of 
the time; for a five-choice item, the probability is only twenty 
per cent; and for a sixteen-choice, such as appears in the 
DAT Verbal Reasoning Test, one can quite safely ignore the 
role of chance. 



dent. It is only when the number of omissions ranges 
from very few to very many that a correction factor 
assumes significance. 

What does the "corrected" score mean? Is it an in- 
dication of the number of items to which the examinee 
definitely knows the answers because the number of 
correct guesses has been subtracted? After some con- 
sideration, one can see that the corrected score does 
not actually mean this. For the correction formula to 
be strictly applicable, the examinee must have made 
pure chance responses to all the items which he 
marked incorrectly and to some of the items which he 
marked correctly. For that to occur, all of the options 
for an item to which the examinee responds by chance 
must seem to him equally likely to be right Ordinarily, 
if the examinee is even half awake when he is taking 
the test, all the options will not be equally attractive. 
He can probably rule out some of the options quite 
readily. It is also obvious that influences other than 
chance enter into the picture. It is entirely possible 
that an examinee answers an item incorrectly because 
he has definite misinformation on the topic or because 
he has partial information which misleads him. In 
such cases, he did not really guess at the answer, in a 
chance sense. Since the examinee rarely chooses purely 
by chance among the possible answers presented to 
him, the basic assumption underlying the correction 
for guessing is violated. In some instances the correc- 
tion for guessing may overcorrect and in other in- 
stances, it may undercorrect. In general, the correction 
for guessing probably yielr'.s a reasonable approxi- 
mation of the true situation not because of the inherent 
soundness of its assumptions but rather because it 
tends to be a compromise between too much correc- 
tion and not enough correction. 

If the correction for guessing is based on conditions 
which are practically never met, some terms and con- 
cepts regarding the meaning of test scores should be 
altered. For example, it is not uncommon to hear that 
a particular student or applicant got no more than a 
**chance score" on a test if he answered twenty items 
correctly out of a total of 100 five-choice items. It is 
felt that such a score is no more than the effect of 
chance, and the correction for guessing if he tried 
every item will reduce this score to zero. It is no more 
accurate to say that this student got a "chance score" 
than it is to say that a pair of loaded dice respond to 
the laws of chance. There is probably no such thing as 
a chance score on a test appropriate to the person 
and the situation unless the examinee is blindfolded 
when he takes the test. Zero scores and negative 



14 



TEST SERVICE BULLETIN 



scores, which sometimes result from a correction for 
guessing, are not indications of no knowledge whatso- 
ever regarding the materials in the test. Such scores 
are probably obtained through the interaction of (a) 
positive correct information on some items, (b) guess- 
ing and partial information and positive misinforma- 
tion on other items, and ( c ) overcorrection for 
guessing. 

The correction for guessing is widely used in scoring 
power tests and there are some situations in which its 
use is advisable. Some students are bold, and answer 
questions when they are not sure of the answers while 
their more timid colleagues would rather omit those 
questions. If the test score is simply the number of ' 
correct responses, there will be a premium for bold- 
ness. In such instances, it seems reasonable to correct 
the scores by subtracting a proportion of the number 
of wrong responses. It woidd, however, be more logi- 
cal to the correction factor a penalty for wrong 
responses than to call it a correction for guessing. 

When an item is omitted in an untimed test we can 
generally assume that the examinee had the oppor- 
tunity to read the question but, for one reason or an- 
other, refused to respond. Speed tests present a some- 
what different problem. True speed tests are made up 
of questions which are extremely easy and the ex- 
aminee wiU almost always answer ccTectly if he has 
the opportunity to read the item. Most of the omis- 
sions in a pure speed test are due to the fact that the 
examinee never got a chance to answer the items be- 
cause the time was up before he could reach them. In 
tests of this type we usually find very few or no omis- 
sions and relatively few wrong answers between the 
first item and the last item attempted. Consequently 
there is no need for using a correction scheme. The 
number of right answers is entirely adequate as a 
score. 

Many tests may best be described as a mixture of 
power and speed. In such tests speed is an important 
factor, but the items vary in difficulty and are gener- 
ally arranged in order of difficulty. Between the first 
and the last items attempted the examinee is very 
likely to encounter questions which he cannot answer 
with certainty and he must then decide whether or 
not he will risk a guess. There may be considerable 
variation in the number of omissions up to the last 
item attempted and in the number of wrong responses. 
If this is found to be the case, the situation is simi- 
lar to that found in power tests. A corrected score 
may then be advisable. There are, however, great 



differences among tests of the mixed power and speed 
type in the extent to which i^ems are omitted or 
answered incorrectly. The actual effectiveness of a 
correction applied to the number of right answers must 
be evaluated for each test separately. 

Before leaving the topic of speeded tests we should 
note a situation which occasionally arises. Many 
speeded tests are scored by merely recording the num- 
ber of right answers. A test-wise examinee who knows 
how the test is scored may answer the questions to 
the best of his ability until shortly before the time is 
up. He may then hastily record an answer to each of 
the remaining itenfis widiout even stopping to read the 
questions. He is thus almost bound to pick up some 
points of score without any danger of incurring a 
penalty. If this kind of test-taking occurs with any 
frequency, it would be advisable to apply a correction 
to the scores. 

It is the fond hope of those who construct power 
tests that the examinees will answer all the items. If 
this were to happen, there would be no need for 
correcting scores. We know, however, that there are 
differences among examinees in their willingness to 
leave items unanswered. It is likely that more people 
might be induced to answer more items if the direc- 
tions for the test stated that omissions would be 
counted as wrong responses or if all were encouraged 
to guess whenever they did not know the answers. 
Such directions would doubtless disturb those educa- 
tors who feel that encouraging students to guess 
makes for loose thinking and disrespectful attitudes 
toward learning. This view may have validity if tests 
are being used for moralistic or character-building 
puiposes alone, but this is rare. Usually a tests essen- 
tial function is that of a measuring instrument and as 
such it should be kept as uncontaminated as possible. 
One source of contamination is the matter of boldness 
vs. caution in taking tlie test. The imposition of a pen- 
alty for wrong answers is an attempt to control this 
type of contamination. More effective control would 
be achieved if all students were encouraged to be 
equally bold, so to speak, by answering every item. 

What can be said, in summary, about the correction 
for guessing? In the first place, ''correction for guess- 
ing^ is essentially a misnomer; the correction could 
more properly be called a penalty for answering 
wrong. Some of these wrong answers may have been 
given in accordance with the laws of chance but more 
of them probably are based on misinformation or par- 
tial information. 



IS 



TEST SERVICE BULLETIN 



Second, the basic assumption underlying the correc- 
tion for guessing is the concept of the **chance score." 
Thus one expects a proportion of the number of items 
to be answered correctly on the basis of chance. This 
concept is misleading and may make it appear that 
the examinee knows the answers to fewer questions 
than he really does know* 

Third, the correction for guessing makes a di£Ference 
in the relative positions of individuals in a group if 
there is considerable variation in the number of 
omitted items. To eliminate a premium for willingness 
to answer items, it seems advisable to use corrected 
stores. It shouJd be remembered, however, that such 
corrected scores are an attempt to rule out the effects 
of differential boldness in taldng the test rather than 
a method for getting a true picture of the examinee's 
knowledge. 

Fourth, when a comparison of the corrected and un- 
corrected scores on a power test shows considerable 
discrepancies in relative standing of the examinees, 
the question is not which t>'pe of score should be used. 



The question is whether or not the test is really a 
power test and whether it is appropriate for xise with 
the group. 

The fundamental piupose in giving a test is to ob- 
tain samples of behavior which will perrait compari- 
sons with respect to son^e reasonably well defined 
attribute among individaali* in the group tested. Effec- 
tive discrimination among the examinees must be 
demonstrated for each test in specific situations. There 
is no reason to believe that any scoring formula con- 
tributes materially to a test's discriminating power. 

Many published test' require use of a correction 
formula to obtain the score. The user of such tests 
must necessarily abide by the scoring instructions 
siKce otherwise he camiot compare his scores with 
the norms. In making his own objective tests, the 
teacher or personnel man need not feel that a correc- 
tion for guessing is essential to the construction of a 
good test Reliability and validity may still be obtained 
with either corrected or uncorrected scores. — ^J.E.D. 




